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Abstract 


This document presents a conservative loss recovery algorithm for TCP 
that is based on the use of the selective acknowledgment (SACK) TCP 
option. The algorithm presented in this document conforms to the 
spirit of the current congestion control specification (RFC 2581), 
but allows TCP senders to recover more effectively when multiple 
segments are lost from a single flight of data. 


Terminology 
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
"SHOULD", “SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 


document are to be interpreted as described in BCP 14, RFC 2119 
[RFC2119]. 
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T 


Introduction 


This document presents a conservative loss recovery algorithm for TCP 
that is based on the use of the selective acknowledgment (SACK) TCP 
option. While the TCP SACK [RFC2018] is being steadily deployed in 
the Internet [A1100], there is evidence that hosts are not using the 
SACK information when making retransmission and congestion control 
decisions [PF01]. The goal of this document is to outline one 
straightforward method for TCP implementations to use SACK 
information to increase performance. 


[RFC2581] allows advanced loss recovery algorithms to be used by TCP 
[RFC793] provided that they follow the spirit of TCP’s congestion 
control algorithms [RFC2581, RFC2914]. [RFC2582] outlines one such 
advanced recovery algorithm called NewReno. This document outlines a 
loss recovery algorithm that uses the SACK [RFC2018] TCP option to 
enhance TCP’s loss recovery. The algorithm outlined in this 
document, heavily based on the algorithm detailed in [FF96], is a 
conservative replacement of the fast recovery algorithm [Jac90, 
RFC2581]. The algorithm specified in this document is a 
straightforward SACK-based loss recovery strategy that follows the 
guidelines set in [RFC2581] and can safely be used in TCP 
implementations. Alternate SACK-based loss recovery methods can be 
used in TCP as implementers see fit (as long as the alternate 
algorithms follow the guidelines provided in [RFC2581]). Please 
note, however, that the SACK-based decisions in this document (such 
as what segments are to be sent at what time) are largely decoupled 
from the congestion control algorithms, and as such can be treated as 
separate issues if so desired. 


Definitions 


The reader is expected to be familiar with the definitions given in 
[RFC2581]. 


The reader is assumed to be familiar with selective acknowledgments 
as specified in [RFC2018]. 


For the purposes of explaining the SACK-based loss recovery algorithm 
we define four variables that a TCP sender stores: 


"HighACK" is the sequence number of the highest byte of data that 
has been cumulatively ACKed at a given point. 


"HighData" is the highest sequence number transmitted at a given 
point. 
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"HighRxt"™ is the highest sequence number which has been 
retransmitted during the current loss recovery phase. 


"Pipe" is a sender’s estimate of the number of bytes outstanding 
in the network. This is used during recovery for limiting the 
sender’s sending rate. The pipe variable allows TCP to use a 
fundamentally different congestion control than specified in 
[RFC2581]. The algorithm is often referred to as the "pipe 
algorithm". 


For the purposes of this specification we define a "duplicate 
acknowledgment" as a segment that arrives with no data and an 
acknowledgment (ACK) number that is equal to the current value of 
HighACK, as described in [RFC2581]. 


We define a variable "DupThresh" that holds the number of duplicate 
acknowledgments required to trigger a retransmission. Per [RFC2581] 
this threshold is defined to be 3 duplicate acknowledgments. 
However, implementers should consult any updates to [RFC2581] to 
determine the current value for DupThresh (or method for determining 
its value). 


Finally, a range of sequence numbers [A,B] is said to "cover" 
sequence number S if A <= S <= B. 


3 Keeping Track of SACK Information 


For a TCP sender to implement the algorithm defined in the next 
section it must keep a data structure to store incoming selective 
acknowledgment information on a per connection basis. Such a data 
structure is commonly called the "scoreboard". The specifics of the 
scoreboard data structure are out of scope for this document (as long 
as the implementation can perform all functions required by this 
specification). 


Note that this document refers to keeping account of (marking) 
individual octets of data transferred across a TCP connection. A 
real-world implementation of the scoreboard would likely prefer to 
manage this data as sequence number ranges. The algorithms presented 
here allow this, but require arbitrary sequence number ranges to be 
marked as having been selectively acknowledged. 
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4 Processing and Acting Upon SACK Information 


For the purposes of the algorithm defined in this document the 
scoreboard SHOULD implement the following functions: 


Update (): 


Given the information provided in an ACK, each octet that is 
cumulatively ACKed or SACKed should be marked accordingly in the 
scoreboard data structure, and the total number of octets SACKed 
should be recorded. 


Note: SACK information is advisory and therefore SACKed data MUST 
NOT be removed from TCP’s retransmission buffer until the data is 
cumulatively acknowledged [RFC2018]. 


IsLost (SeqNum) : 


This routine returns whether the given sequence number is 
considered to be lost. The routine returns true when either 
DupThresh discontiguous SACKed sequences have arrived above 
'SeqNum’ or (DupThresh * SMSS) bytes with sequence numbers greater 
than ‘’SeqNum’ have been SACKed. Otherwise, the routine returns 
false. 


SetPipe (): 


This routine traverses the sequence space from HighACK to HighData 
and MUST set the "pipe" variable to an estimate of the number of 
octets that are currently in transit between the TCP sender and 
the TCP receiver. After initializing pipe to zero the following 
steps are taken for each octet ’S1’ in the sequence space between 
HighACK and HighData that has not been SACKed: 


(a) If IsLost (S1) returns false: 
Pipe is incremented by 1 octet. 


The effect of this condition is that pipe is incremented for 
packets that have not been SACKed and have not been determined 
to have been lost (i.e., those segments that are still assumed 
to be in the network). 


(bo) If S1 <= HighRxt: 


Pipe is incremented by 1 octet. 
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The effect of this condition is that pipe is incremented for 
the retransmission of the octet. 


Note that octets retransmitted without being considered lost are 
counted twice by the above mechanism. 


NextSeg (): 


This routine uses the scoreboard data structure maintained by the 
Update() function to determine what to transmit based on the SACK 
information that has arrived from the data receiver (and hence 
been marked in the scoreboard). NextSeg () MUST return the 
sequence number range of the next segment that is to be 
transmitted, per the following rules: 


(1) If there exists a smallest unSACKed sequence number ’S2’ that 
meets the following three criteria for determining loss, the 
sequence range of one segment of up to SMSS octets starting 
with S2 MUST be returned. 


(1.a) S2 is greater than HighRxt. 


(1.b) S2 is less than the highest octet covered by any 
received SACK. 


(l.c) IsLost (S2) returns true. 


(2) If no sequence number ’S2’ per rule (1) exists but there 
exists available unsent data and the receiver’s advertised 
window allows, the sequence range of one segment of up to SMSS 
octets of previously unsent data starting with sequence number 
HighData+l MUST be returned. 


(3) If the conditions for rules (1) and (2) fail, but there exists 
an unSACKed sequence number ’S3’ that meets the criteria for 
detecting loss given in steps (1.a) and (1.b) above 
(specifically excluding step (1l.c)) then one segment of up to 
SMSS octets starting with S3 MAY be returned. 


Note that rule (3) is a sort of retransmission "last resort". 
It allows for retransmission of sequence numbers even when the 
sender has less certainty a segment has been lost than as with 
rule (1). Retransmitting segments via rule (3) will help 
sustain TCP’s ACK clock and therefore can potentially help 
avoid retransmission timeouts. However, in sending these 
segments the sender has two copies of the same data considered 
to be in the network (and also in the Pipe estimate). When an 
ACK or SACK arrives covering this retransmitted segment, the 
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sender cannot be sure exactly how much data left the network 
(one of the two transmissions of the packet or both 
transmissions of the packet). Therefore the sender may 
underestimate Pipe by considering both segments to have left 
the network when it is possible that only one of the two has. 


We believe that the triggering of rule (3) will be rare and 
that the implications are likely limited to corner cases 
relative to the entire recovery algorithm. Therefore we leave 
the decision of whether or not to use rule (3) to 
implementors. 


(4) If the conditions for each of (1), (2), and (3) are not met, 
then NextSeg () MUST indicate failure, and no segment is 
returned. 


Note: The SACK-based loss recovery algorithm outlined in this 
document requires more computational resources than previous TCP loss 
recovery strategies. However, we believe the scoreboard data 
structure can be implemented in a reasonably efficient manner (both 
in terms of computation complexity and memory usage) in most TCP 
implementations. 


5 Algorithm Details 


Upon the receipt of any ACK containing SACK information, the 
scoreboard MUST be updated via the Update () routine. 


Upon the receipt of the first (DupThresh - 1) duplicate ACKs, the 
scoreboard is to be updated as normal. Note: The first and second 
duplicate ACKs can also be used to trigger the transmission of 
previously unsent segments using the Limited Transmit algorithm 
[RFC3042]. 


When a TCP sender receives the duplicate ACK corresponding to 
DupThresh ACKs, the scoreboard MUST be updated with the new SACK 
information (via Update ()). If no previous loss event has occurred 
on the connection or the cumulative acknowledgment point is beyond 
the last value of RecoveryPoint, a loss recovery phase SHOULD be 
initiated, per the fast retransmit algorithm outlined in [RFC2581]. 
The following steps MUST be taken: 


(1) RecoveryPoint = HighData 


When the TCP sender receives a cumulative ACK for this data octet 
the loss recovery phase is terminated. 
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(2) 


(5) 


ssthresh = cwnd = (FlightSize / 2) 


The congestion window (cwnd) and slow start threshold (ssthresh) 
are reduced to half of FlightSize per [RFC2581]. 


Retransmit the first data segment presumed dropped -- the segment 
starting with sequence number HighACK + 1. To prevent repeated 
retransmission of the same data, set HighRxt to the highest 
sequence number in the retransmitted segment. 


Run SetPipe () 


Set a "pipe" variable to the number of outstanding octets 
currently "in the pipe"; this is the data which has been sent by 
the TCP sender but for which no cumulative or selective 
acknowledgment has been received and the data has not been 
determined to have been dropped in the network. It is assumed 
that the data is still traversing the network path. 


In order to take advantage of potential additional available 
cwnd, proceed to step (C) below. 


Once a TCP is in the loss recovery phase the following procedure MUST 
be used for each arriving ACK: 


(A) 


An incoming cumulative ACK for a sequence number greater than 
RecoveryPoint signals the end of loss recovery and the loss 
recovery phase MUST be terminated. Any information contained in 
the scoreboard for sequence numbers greater than the new value of 
HighACK SHOULD NOT be cleared when leaving the loss recovery 
phase. 


Upon receipt of an ACK that does not cover RecoveryPoint the 
following actions MUST be taken: 


(B.1) Use Update () to record the new SACK information conveyed 
by the incoming ACK. 


(B.2) Use SetPipe () to re-calculate the number of octets still 
in the network. 


If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more 
segments as follows: 


(C.1) The scoreboard MUST be queried via NextSeg () for the 
sequence number range of the next segment to transmit (if any), 
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and the given segment sent. If NextSeg () returns failure (no 
data to send) return without sending anything (i.e., terminate 
steps C.1 -- C.5). 


(C.2) If any of the data octets sent in (C.1) are below HighData, 
HighRxt MUST be set to the highest sequence number of the 
retransmitted segment. 


(C.3) If any of the data octets sent in (C.1) are above HighData, 
HighData must be updated to reflect the transmission of 
previously unsent data. 


(C.4) The estimate of the amount of data outstanding in the 
network must be updated by incrementing pipe by the number of 
octets transmitted in (C.1). 


(C.5) If cwnd - pipe >= 1 SMSS, return to (C.1) 
5.1 Retransmission Timeouts 


In order to avoid memory deadlocks, the TCP receiver is allowed to 
discard data that has already been selectively acknowledged. As a 
result, [RFC2018] suggests that a TCP sender SHOULD expunge the SACK 
information gathered from a receiver upon a retransmission timeout 
"since the timeout might indicate that the data receiver has 
reneged." Additionally, a TCP sender MUST "ignore prior SACK 
information in determining which data to retransmit." However, a 
SACK TCP sender SHOULD still use all SACK information made available 
during the slow start phase of loss recovery following an RTO. 


If an RTO occurs during loss recovery as specified in this document, 
RecoveryPoint MUST be set to HighData. Further, the new value of 
RecoveryPoint MUST be preserved and the loss recovery algorithm 
outlined in this document MUST be terminated. In addition, a new 
recovery phase (as described in section 5) MUST NOT be initiated 
until HighACK is greater than or equal to the new value of 
RecoveryPoint. 


As described in Sections 4 and 5, Update () SHOULD continue to be 
used appropriately upon receipt of ACKs. This will allow the slow 
start recovery period to benefit from all available information 
provided by the receiver, despite the fact that SACK information was 
expunged due to the RTO. 


If there are segments missing from the receiver’s buffer following 
processing of the retransmitted segment, the corresponding ACK will 
contain SACK information. In this case, a TCP sender SHOULD use this 
SACK information when determining what data should be sent in each 
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segment of the slow start. The exact algorithm for this selection is 
not specified in this document (specifically NextSeg () is 
inappropriate during slow start after an RTO). A relatively 
straightforward approach to "filling in" the sequence space reported 
as missing should be a reasonable approach. 


6 Managing the RTO Timer 


The standard TCP RTO estimator is defined in [RFC2988]. Due to the 
fact that the SACK algorithm in this document can have an impact on 
the behavior of the estimator, implementers may wish to consider how 
the timer is managed. [RFC2988] calls for the RTO timer to be 
re-armed each time an ACK arrives that advances the cumulative ACK 
point. Because the algorithm presented in this document can keep the 
ACK clock going through a fairly significant loss event, 
(comparatively longer than the algorithm described in [RFC2581]), on 
some networks the loss event could last longer than the RTO. In this 
case the RTO timer would expire prematurely and a segment that need 
not be retransmitted would be resent. 


Therefore we give implementers the latitude to use the standard 
[RFC2988] style RTO management or, optionally, a more careful variant 
that re-arms the RTO timer on each retransmission that is sent during 
recovery MAY be used. This provides a more conservative timer than 
specified in [RFC2988], and so may not always be an attractive 
alternative. However, in some cases it may prevent needless 
retransmissions, go-back-N transmission and further reduction of the 
congestion window. 


7 Research 


The algorithm specified in this document is analyzed in [FF96], which 
shows that the above algorithm is effective in reducing transfer time 
over standard TCP Reno [RFC2581] when multiple segments are dropped 
from a window of data (especially as the number of drops increases). 
[AHKO97] shows that the algorithm defined in this document can 
greatly improve throughput in connections traversing satellite 
channels. 


8 Security Considerations 


The algorithm presented in this paper shares security considerations 
with [RFC2581]. A key difference is that an algorithm based on SACKs 
is more robust against attackers forging duplicate ACKs to force the 
TCP sender to reduce cwnd. With SACKs, TCP senders have an 
additional check on whether or not a particular ACK is legitimate. 
While not fool-proof, SACK does provide some amount of protection in 
this area. 
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